88
Algorithms for Binary Neural Networks
However, if the conditions in Eq. (3.148) are met, with Eq. (3.145) concluded, the gradi-
ent of ˆwt+1 is formulated as:
∂L
∂ˆwt+1 = ∂L
∂ˆwt −η
∂2L
∂( ˆwt)2 ≥γ,
η ∂2L
∂( ˆwt)2 ≤∂L
∂ˆwt −γ ≤−2γ.
(3.149)
Note that η and γ are two positive variables, thus the second-order gradient
∂2L
∂( ˆwt)2 < 0
holds always. Consequently, L( ˆwt+1) can only be local maxima rather than a minimum,
which raises a contradiction to convergence in the training process. Such a contradiction
indicates that the training algorithm will be convergent until no oscillation occurs due to
the additional term S(α, w). Therefore, we complete our proof.
□
Our proposition and proof reveal that the balanced parameter γ is a “threshold.” A
minimal “threshold” fails to mitigate the frequent oscillation effectively, while a too-large
threshold suppresses the necessary sign inversion and hinders the gradient descent process.
To solve this, we devise the learning rule of γ as:
γn,t+1
i
=
1
M n ∥bwn,t
i
⊛bwn,t+1
i
−1∥0 ·
max
1≤j≤M n(| ∂L
∂ˆwn,t
i,j
|),
(3.150)
where the first element
1
M n ∥bwn,t
i
⊛bwn,t+1
i
−1∥0 denotes the proportion of weights with
change of sign. The second item max1≤j≤M n(|
∂L
∂ˆwn,t
i,j |) is derived from Eq. (3.148), denoting
the gradient with the greatest magnitude of the t-th iteration. In this way, we suppress the
frequent weight oscillation with a resilient gradient.
We further optimize the scaling factor as follows:
δαn
i = ∂L
∂αn
i
+ ∂LR
∂αn
i
.
(3.151)
The gradient derived from the softmax loss can be easily calculated based on backprop-
agation. Based on Eq. (6.88), it is easy to derive:
∂LR
∂αn
i
= γn
i (wn
i −αn
i bwn
i ) ⊛bwn
i .
(3.152)
3.9.3
Ablation Study
Since our ReBNN does not introduce additional hyperparameters, we first evaluate the
different calculations of γ. Then we show how our ReBNN achieves a resilient training
process. In the ablation study, we used the ResNet-18 backbone initialized from the first
stage training with W32A1 following [158].
Calculation of γ: We compare the different calculations of γ in this part. As shown in
Table 3.7, the performances increase first and then decrease when the value of constant γ.
Considering that the magnitude of the gradient varies in both layer and channel senses, a
subtle γ can hardly be manually set as a global value. We further compare the gradient-based
calculation. As shown in the bottom lines, we first use max1≤j≤M n(|
∂L
∂ˆwn,t
i,j |), the maximum
intrachannel gradient. of the last iteration, which performs similarly to the constant 1e−4.
This indicates that only using the maximum intra-channel gradient may suppress necessary